Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36311: [C++] Fix integer overflows in utf8_slice_codeunits #36575

Merged
merged 5 commits into from
Jul 11, 2023

Conversation

benibus
Copy link
Collaborator

@benibus benibus commented Jul 9, 2023

Rationale for this change

The default value for the SliceOptions::stop is INT64_MAX, which isn't considered in several internal calculations - resulting in integer overflows and unexpected behavior when stop isn't provided.

Also note that running the included tests without the fixes should result in ubsan errors (it did for me, at least).

What changes are included in this PR?

  • Adds some logic to SliceCodunitsTransform that handles potential overflows
  • Adds tests for cases where the start param is positive/negative and stop is the maximum value

Update
Discovered that utf8_slice_codeunits deviates from Python array behavior when stop=None and step < 0, so further changes were made:

  • Handles INT64_MIN for SliceOptions::stop on C++ side, adds more tests.
  • Updates Python bindings for SliceOptions so that the default value when stop=None (sys.maxsize) is negated when step < 0
  • Adds None as a possible stop value in Python tests

Are these changes tested?

Yes (tests are included)

Are there any user-facing changes?

In theory, altering the behavior of utf8_slice_codepoints when stop=None and step < 0 could be considered a breaking change. That being said, the current implementation produces incorrect results whenever None is even used, so it probably isn't one in practice...

@benibus benibus marked this pull request as ready for review July 9, 2023 17:25
@benibus
Copy link
Collaborator Author

benibus commented Jul 9, 2023

@pitrou I haven't added any python tests (although I can if necessary) - but the original examples should be working now as well.

>>> pa.compute.utf8_slice_codeunits(f"AB{chr(127917)}C{chr(127917)}ㇱD", start=2, stop=None, step=4)
<pyarrow.StringScalar: '🎭D'>
>>> s = f"AB{chr(127917)}C{chr(127917)}ㇱD"
>>> a = pa.array([s])
>>> pc.utf8_slice_codeunits(a, start=0, stop=None)
<pyarrow.lib.StringArray object at 0x7f0fa4bb85e0>
[
  "AB🎭C🎭ㇱD"
]

@pitrou
Copy link
Member

pitrou commented Jul 10, 2023

The current Python tests already exercise the utf8_slice_codeunits function against Python slicing, so it would be nice to expand them a bit to also cover the case where stop is unbounded:

def test_slice_compatibility():

@pitrou
Copy link
Member

pitrou commented Jul 10, 2023

This PR is almost ready for merging and would be nice to include in 13.0.0. @raulcd

@pitrou
Copy link
Member

pitrou commented Jul 10, 2023

@github-actions crossbow submit -g cpp

@github-actions

This comment was marked as outdated.

@benibus
Copy link
Collaborator Author

benibus commented Jul 10, 2023

This may need more work, unfortunately. I've learned that passing stop=None doesn't produce the same results as it would for python arrays if step is also negative. For instance:

>>> s = pa.scalar("𝑓öõḍš")
>>>
>>> pc.utf8_slice_codeunits(s, start=2, stop=None, step=-1)
<pyarrow.StringScalar: ''>
>>>
>>> s.as_py()[2:None:-1]
'õö𝑓'

The current behavior is actually equivalent to this:

>>> s.as_py()[2:sys.maxsize:-1]
''

So yeah... expanding the python tests was a good call.

When `stop=None`, negates the default value of `sys.maxsize` if `step <
0` to mimic the behavior of Python arrays
@pitrou
Copy link
Member

pitrou commented Jul 10, 2023

@github-actions crossbow submit -g cpp

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 10, 2023
@github-actions

This comment was marked as outdated.

@pitrou
Copy link
Member

pitrou commented Jul 10, 2023

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: 58eb873

Submitted crossbow builds: ursacomputing/crossbow @ actions-4f1abc3b6c

Task Status
test-alpine-linux-cpp Github Actions
test-build-cpp-fuzz Github Actions
test-conda-cpp Github Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp Github Actions
test-debian-11-cpp-amd64 Github Actions
test-debian-11-cpp-i386 Github Actions
test-fedora-35-cpp Github Actions
test-ubuntu-20.04-cpp Github Actions
test-ubuntu-20.04-cpp-bundled Github Actions
test-ubuntu-20.04-cpp-minimal-with-formats Github Actions
test-ubuntu-20.04-cpp-thread-sanitizer Github Actions
test-ubuntu-22.04-cpp Github Actions
test-ubuntu-22.04-cpp-20 Github Actions

@wirable23
Copy link

Thank you for fixing this @benibus! I think this also fixes few other open issues:

#34917
#34928
#14991
#34929

@pitrou
Copy link
Member

pitrou commented Jul 11, 2023

@raulcd It would be nice to cherry-pick this in 13.0.0.

raulcd pushed a commit that referenced this pull request Jul 11, 2023
### Rationale for this change

The default value for the `SliceOptions::stop` is `INT64_MAX`, which isn't considered in several internal calculations - resulting in integer overflows and unexpected behavior when `stop` isn't provided.

Also note that running the included tests without the fixes should result in ubsan errors (it did for me, at least).

### What changes are included in this PR?

- Adds some logic to `SliceCodunitsTransform` that handles potential overflows
- Adds tests for cases where the `start` param is positive/negative and `stop` is the maximum value

**Update**
Discovered that `utf8_slice_codeunits` deviates from Python array behavior when `stop=None` and `step < 0`, so further changes were made:
- Handles `INT64_MIN` for `SliceOptions::stop` on C++ side, adds more tests.
- Updates Python bindings for `SliceOptions` so that the default value when `stop=None` (`sys.maxsize`) is negated when `step < 0`
- Adds `None` as a possible `stop` value in Python tests

### Are these changes tested?

Yes (tests are included)

### Are there any user-facing changes?

In theory, altering the behavior of `utf8_slice_codepoints` when `stop=None` and `step < 0` could be considered a breaking change. That being said, the current implementation produces incorrect results whenever `None` is even used, so it probably isn't one in practice...

* Closes: #36311

Authored-by: benibus <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 143c691.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

@rachtsingh
Copy link

It would also be nice to backport this to 12.0.0 if possible, since right now we can't upgrade to 13.0.0 (see pandas-dev/pandas#55374 (comment)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Python] utf8_slice_codeunits produces invalid unicode sequence
4 participants